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Abstract 

We introduce a learning method called "gradient-based reinforcement planning" (GREP). 
Unlike traditional DP methods that improve their policy backwards in time, GREP is a 
gradient-based method that plans ahead and improves its policy before it actually acts in the 
environment. We derive formulas for the exact policy gradient that maximizes the expected 
future reward and confirm our ideas with numerical experiments. 



1 Introduction 

It has been shown that planning can dramatically improve convergence in reinforcement learning 
(RL) (?; ?). However, most RL methods that explicitly use planning that have been proposed are 
value (or Q- value) based methods, such as Dyna-Q or prioritized sweeping. 

Recently, much attention is directed to so-called policy- gradient methods that improve their 
policy directly by calculating the derivative of the future expected reward with respect to the 
policy parameters. Gradient based methods are believed to be more advantageous than value- 
function based methods in huge state spaces and in POMDP settings. Probably the first gradient 
based RL formulation is class of REINFORCE algorithms of 

Williams (?). Other more recent methods are, e.g., (?; ?; ?). Our approach of deriving the 
gradient has the flavor of (?) who derive the gradient using future state probabilities. 

Our novel contribution in this paper is to combine gradient-based learning with explicit plan- 
ning. We introduce "gradient-based reinforcement planning" (GREP) that improves a policy 
before actually interacting with the environment. We derive the formulas for the exact policy 
gradient and confirm our ideas in numerical experiments. GREP learns the action probabilities 
of a probabilistic policy for discrete problem. While we will illustrate GREP with a small MDP 
maze, it may be used for the hidden state in POMDPs. 



*This is an extended version of the paper presented at the EWRX 2001 in Utrecht (The Netherlands). In 
this technical report, the derivation steps are presented with more detail, more footnotes, appendices and more 
(unfinished) ideas. 
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2 Derivation of the policy gradient 

projection matrix 

Let us denote the discrete space of states by S = {1, ...,N}. Our belief on S is decribed by a 
probability vector s of which each element Sj represents the probability of being in state i. We 
also define a set of actions A = l,...,K. The stochastic policy V is represented by a matrix 
P : N x K with elements Pm = p(cik\si), i.e. the conditional probability of action k in state 
Furthermore, let environment £ be defined by transition matrices (k — 1, K) with elements 
Tkji = p{sj\si, Ofc), i.e. the transition probability to sj in state Si given action k. 
Now we define the projection matrix F with elements 

F ji = Y, T mPu- (i) 

k 

Important is that matrix F is not modelling the transition probabilities of the environment, but 
models the induced transition probability using policy V in environment £. The induced transition 
probability Fji is a weighted sum over actions k of the transition probabilities T^ij with the policy 
parameters Phi as the weights. 



Expected state occupancy 

Using the projection matrix F, states s t and s t+ i are related as s t+ i = Fs t and therefore s t — F'so, 
where So is the state probability distribution at t = 0. We can now define the expected state 
occupancy as 

oo oo 

z = £[s|s ] = ]T 7 4 s t = X> F )' S ° = (I - 7 F) _1 so (2) 

where 7 is a discount factor in order to keep the sum finite. In the last step, we have recognized 
the sum as the Neumann representation of the inverse. Notice that z is a solution of the linear 
equation 

(I - 7 F)z = s (3) 
which is just the familiar Bellman equation for the expected occupancy probability z. 

Expected reward function 

In reinforcement learning (RL) the objective is to maximize future reward. We define a reward 
vector r in the same domain as s. Using the expected occupancy z the future expected reward H 
is simply 

H=(t,e) (4) 

where (•, •) is the scalar vector product.^] 

Because z is a solution of Eq. ||it is dependent on F which in turn depends on policy P. Given 
r and so, our task is to find the optimal P* such that H is maximized, i.e. 

P* = argmaxiJ. (7) 

1 For computational reasons, one often reparameterizes the policy using a Boltzmann distribution. Here in this 
paper the probability p(a k \si) is just given by P ki and we do not use reparameterization in order to keep the 
analysis clear. 

2 In optimal control (OC) we want to reach some target state under some optimality conditions (mostly minimum 
time or minimum energy). We denote r as our target distribution and denote the time-to-arrival as t* . If t* were 
known beforehand then we 

H=i(r-s t 2 = i(r-F t *s ) 2 (5) 
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We can regard the calculation of the future expected reward as a composition of two operators 

Q:Fn Z (8) 
which maps the transition matrix F to the expected occupancy probabilities z, and 

■R:z^R (9) 
which maps the probabilities z, given a reward distribution r, to an expected reward value R. 

Calculation of the policy gradient 

A variation Sz in the expected occupancy can be related to first order to a perturbation SP in the 
(stochastic) policy. To obtain the partial derivatives dz / dPik , we differentiate Eq. |^ with respect 
to Pik and obtain: 

-W + (I - TF) ^ = a (10) 

The right hand side of the equation is zero because we assume that sq is independent of Pik- 
Rearranging gives: 

dz OF 

where K = (I 7F)- 1 . 

From Eq. |] and Eq. 10, together with the chain rule, we obtain the gradient of the RL error 
with respect to the policy parameters P^: 

dH dH dz I OF \ I OF \ 

dP lk dz dP lk \ dP lk I \ dP ik I 

where A* means the adjoint operator of A defined by (u, Aw) = (A*u, w). Let us define: q = K*r. 
While K maps the initial state So to the future expected state occupancy z, its adjoint, K*, maps 
the reward vector r back to expected reward q. The value of qi represents the (pointwise) expected 
reward in state Si for policy P. p| 

Finally, differentiating Eq. [l|gives us dF/dPik- Inserting this into Eq. [l^ yields: 

Gik = cx Zi^TkjiQj. (13) 

3 

In words, the gradient of H with respect to policy parameter Pik (i.e. the probability of taking 
action a^- in state Si) is proportional to the expected occupancy times the weighted sum of 
expected reward qj over next states (j = 1, N) weighted by the transition probabilities T^ji- 

However the exact arrival time is mostly not known beforehand, the most we can do is to use minimize the 
(time-weighted) expected error 

00 00 
" ^(r-F** so) 2 = ifcr 2 -rz+ 1]T 7 '* ( F '*s ) 2 (6) 

i*=0 t*=0 

The first term on the right hand side is often not relevant because it is independent of F. When we compare Eq. ^ 
with Eq. ^, we see that the OC error function has a quadratic term in F which the RL error lacks. 

3 Indeed, this is a different way to define the traditional value- function. Note that generally, neither r nor q are 
probabilities because their 1-norm is generally not f . 
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Note that the gradient could also have been approximated using finite differences which would 
need at least 1 + n 2 field calculations.^] The adjoint method is much more efficient and needs only 
two field calculations. 

Once we have the gradient G, improving policy P is now straight forward using gradient 
ascent or we can also use more sophisticated gradient-based methods such as nonlinear conjugate 
gradients (as in (?)). The optimization is nonlinear because z and r themselves depend on the 
current estimate of P. 

3 Computation of the optimal policy 

We will introduce two algorithms that incorporate our ideas of gradient-based reinforcement plan- 
ning. The first algorithms describes an off-line planning algorithm that finds the optimal policy 
but assumes that the environment transition probabilities are known. The second algorithm is an 
online version that could cope with unknown environments. 

3.1 Offline GREP 

If the environment transition probabilities T^ji are known, the agent may improve its policy using 
GREP. Our offline GREP planning algorithm consist of two steps: 

1. Plan ahead: Compute the policy gradient G in Eq. [l^ and improve current policy 

P^P + qG (14) 

where a is a suitable step size parameter; for efficiency we can also perform a linesearch on 
a. 

2. Evaluate policy: Repeat above until policy is optimal. 

Matrix P describes a probabilistic policy. We define the maximum probable policy (MPP) to 
be the deterministic policy by taking the maximum probable action at each state. It is not obvious 
that the MPP policy will converge to the global optimal solution but we expect MPP at least to 
be near-optimal. 

Numerical experiments 

We performed some numerical experiments using offline GREP. Our test problem was a pure 
planning task in a 10 x 10 toy maze (see Fig. |l|) where the probabilistic policy P represents the 
probability of taking a certain action at a certain maze position. The same figure also shows 
typical solutions for the quantities z and q, i.e. the expected occupancy and expected reward 
respectively (for certain P). 

After each GREP iteration, i.e. after each gradient calculation and P update, we checked the 
obtained policy by running 20 simulations using the current value of P. The probability weighted 
(PW) policy selects action k at state i proportional to P^, while the annealed PW policy uses 
an annealing factor of T = 4; we also simulated the MPP solution. Figure ^ shows the average 
simulated path length versus GREP iteration of the PW, the annealed PW policy and the derived 
MPP policy. In the left plot the initial policy P was taken uniform. The right plot in the same 

4 Finite difference approximation of the derivative dz/dTij involves computing z for F and then perturbing a 
single Tij in F by a tiny amount dT and subsequently recomputing z'. Then the derivative is approximated by 
dz/dTij ss (z' — z)/dT. For a n X n matrix F, one would need to repeat this for every element and would require 
a total upto 1 + n' 2 calculations of z. 
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Figure 1: Left: 10 x 10 toy maze with start at left and goal at right side. Center: plot of expected 
occupancy z. Right: plot of expected reward q. White corresponds to higher probability. [Blurring 
is due to visualisation only] . 
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Figure 2: Plot of simulated path length versus GREP iteration of a small toy MDP maze for the 
probability weighted (PW) policy, annealed PW policy and MPP policy. The shortest path to 
goal is 14. Left: starting from initial uniform policy. Right: starting from initial random policy. 
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figure shows the simulated path lengths from a random policy; also here the MPP finds the optimal 
solution but slightly later. 

We see from the figure that in both cases the probability-weighted (PW) policy is improving 
during the GREP iterations. However, the convergence is very slow which shows the severe non- 
linearity of the problem. The annealed PW does perform better than PW. Finally, we see that 
MPP finds the optimal solution quickly within a few iterations. Using Dijkstra's method, we 
confirmed that the found MPP policy was in agreement with the global shortest path solution. 



Online GREP 

The account below desribe an idea to use GREP when the environment is not known beforehand^]. 
The steps actually interleave "Kalman filter" -like estimation of the unknown environment transi- 
tion probabilities with the explicit planning of GREP. In fact, it also includes a step to estimate 
a possibly unknown (linear) sensor mapping. Apart from the policy matrix P, we need to esti- 
mate also the (environment) transition probabilities Tkji and possibly sensor matrix B. We can 
optimize for all parameters by itcrativcly ascending to their conditional mode. The conditional 
maximizing steps are easy: 

1. Plan ahead: Compute the policy gradient G in Eq. [l^ and improve current policy 

P^P + aG (15) 

where a is a suitable step size parameter; for efficiency we can also perform a linesearch. 
After, we need to renormalize the columns of P. See note below on policy regularization. 

2. Select action: Given state estimate St, draw an action k from the policy according to: 

k ~ P t s t . 

and receive reward R and estimate new state s t+1 . 

3. Estimate state: Observe y*+i and estimate St+i using 

p(st+i|yt+i,s t ,/c) oc p(yt+i|s t+ i)p(s t+ i|Ffc,s t ). (16) 

Assuming Gaussian noise for observations and state estimates we obtain: 

_ H s F fc , s t + B _1 H y y t+ i 
St+1 - H s + B _1 Hj / B~ T 

where B is the sensor matrix that maps internal state s to observations y. Matrices H s 
and Hj, are the inverse covariance (or so called precisions) of state s and observation y 
respectively. 

4. Estimate sensors: In case also sensor matrix B is unknown, we have to perform an additional 
estimation step for B. This is common step in the standard Kalman formulation. 

5. Estimate environment: Given action k and reward R, we update the reward vector 

rj - R (18) 
and the environment transition probabilities 

dF fc cx (st+i - FfeS^sHstsf)- 1 , (19) 



5 At the time of writing, we have not implemented this idea yet 
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or if (s t sf) 



does not exist we can use 



dF fc oc (s i+ i -F fc s t )(sfs t ) 1 s£. 



(20) 



where s t = j, and reestimate the environment transition probabilities 



T fe «- T fe + (s t+1 - T fc s t )sf . 



(21) 



After the update, one should set entries in Fi that corresponds to physically impossible 
transitions to zero. After, we need to renormalize the columns of TV It is important to 
note that given s t and s t+ i transition matrix is conditionally independent of the policy 
P. That is to say, we can obtain an accurate model of the environment using, e.g., just a 
random walk. 

6. Repeat 1. 

To draw a picture of what is happening. In the planning stage, based on the current (and 
maybe not accurate) environment model, the agent tries to improve its current policy by planning 
ahead using the gradient in Eq. Remember that the gradient involves simulating paths from 
the current state and adjoint paths from the goal. In the action stage the agent samples an action 
from its policy. Then the agent senses the new state and updates its environment model using 
this new information. Notice that policy improvement is not done "backwards" as traditionally is 
done in DP methods but "forward" by planning ahead. 

4 Conclusions 

Future topics 

We have tacitly assumed that z and q are computed using the same discount factor 7. However, 
we could introduce separate parameters j z and 7 g which effectively assigns a different "forward 
time window" for z and a "backward time window" for q. In fact when j z — > we have a "one- 
step-look-ahead" . Alternatively, in the limit of 7^ — > we obtain a gradient for a greedy policy 
that maximize only "immediate reward". How both parameters affect GREP's performance is a 
topic for future research. 

The above suggests that GREP can be viewed as a generalization to "one-step-look-ahead" 
policy improvement. In fact, a "one-step-look-ahead" improvement rule using can be obtained for 
7j: — * simply by taking z = s f in Eq. Such an approach would be "policy greedy" in a sense 
that it updates the policy only locally. We expect GREP to perform better because it updates 
the policy more globally; whether this in fact improves GREP is also a remaining issue for future 
research. 

The interleaving of GREP with a Kalman-like estimation procedure of the environment could 
handle a variety of interesting problems such as planning in POMDP environments. 

We must mention that appropriate reparameterization of the stochastic policy, e.g. using a 
Boltzman distribution, could improve the convergence. We have not pursued this further. 



We have introduced a learning method called "gradient-based reinforcement planning" (GREP). 
GREP needs a model of the environment and plans ahead to improve its policy before it actually 
acts in the environment. We have derived formulas for the exact policy gradient. 



Summary 
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Numerical experiments suggest that the probabilistic policy indeed converges to an optimal 
policy — but quite slowly. We found that (at least in our toy example) the optimal solution can be 
found much faster by annealing or simply by taking the most probable action at each state. 

Further work will be to incorporate GREP in online RL learning tasks where the environment 
parameters, i.e. transition probabilities Tkji, are unknown and have to be learned. While an 
analytical solution for q and z are only viable for small problem sizes, for larger problems we 
probably need to investigate Monte Carlo or DP methods. 
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Appendix A: Implicit policies 

In deterministic environments where the state-action pair (si, a m ) uniquely leads to a state Sj, i.e. 
Tkji = 5km{k = 1, K) the projection F is solely determined by the policy z, and vice versa. We 
refer to this as the case of implicit policy because the policy is implicitly implied in the induced 
transition probability Tji. 

In such environments we can suffice to solve for F directly and omit parameterization through 
z. From Eq. [I] we see that 

Pirn — Tji (22) 

and using a similar derivation as we have done for dH/dPik, it can be shown that the gradient of 
H with respect to F is given by 

G = rz T . (23) 

An important point must be mentioned. In most cases many elements Tji are zero, representing 
an absent transition between Si and Sj . Naively updating F using the full gradient G would incur 
complete fill-in of F which is in most cases not desirable or even physically incorrect. Therefore, 
one must check the gradient each time and set impossible transition probabilities to zero. We 
will refer to this "heuristically corrected" gradient as G. Also, after each update, we have to 
renormalize the columns of F. The rank-one update in Eq. ^3] is interesting because it provides 
an efficient means of calculating the inverse in Eq. 0. 
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Figure 3: MC calculations of a 6 x 6 toy maze. The agent is at (0,1) and targets (5,4). (a) 
Expected occupancy, (b) adjoint probability, and (c) normalized policy gradient for n — A, n = 40, 
n = 200. Each vector is computed as v = J2k^ik e k where is the unit vector along the state 
change induced by action a^. 

Appendix B: Monte Carlo gradient sampling 

In our example, we calculated z and q in Eq.^| by linear programming. For large state spaces the 
matrix inversion quickly becomes too computationally intensive and probably traditonal dynamic 
programming based methods would be more efficient. 

Instead, we investigated to use Monte Carlo (MC) simulation. We use forward sampling to 
approximate the expected state occupancies in z and use, so-called, adjoint Monte Carlo sam- 
pling (?) to estimate the adjoint reward q. Adjoint MC simulation of is far more efficient than 
would we have estimated each qi by a separate MC run.^j By performing the simulation backward 
from r, we obtain all values of q using only a single MC run. 

Fig. H shows the MC approximations of z and q. On the right of the same figure, we have plotted 
the computed policy-gradient based on MC estimates using a minimum number of n = {20, 40, 200} 
samples. To compare them with the exact gradient, we calculated the exact values of z and q by 
inverting the linear system. For larger number of samples, the gradient vector do indeed point 
more strongly towards the goal. 

An important feature of general Monte Carlo methods is that they automatically concentrate 
their sampling to the important regions of the parameter space — mostly proportional to the 
posterior or the likelihood. For our purpose of sampling the gradient, to even more concentrate 
the sampling density towards the regions of large gradient values, we have tried to apply annealing. 
To sample from a density p(ff) we may sample from the annealed function p 1 {9) = p(#) 7 / J p(6)' f dd 
and reweight each sample with its importance weight l/p 7 (0). For 7 — > 00, the set of samples 
converges to the maximum probable gradient. 

In conclusion, our approach of separately estimating q and z using MC and then (elementwise) 
multiply their solutions, doesn't really brought clear advantages. If we could sample from the joint 
distribution qiZi (i.e. elementwise product) then MC would clearly turn out to be a very efficient 
method. 



'With one MC run we mean performing, e.g., 10000 trials from a fixed state. 



